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Abstract. Tree automata based algorithms are essential in many fields 
in computer science such as verification, specification, program analy- 
sis. They become also essential for databases with the development of 
standards such as XML. In this paper, we define new classes of non de- 
terministic tree automata, namely residual finite tree automata (RFTA). 
In the bottom-up case, we obtain a new characterization of regular tree 
languages. In the top-down case, we obtain a subclass of regular tree lan- 
guages which contains the class of languages recognized by deterministic 
top-down tree automata. RFTA also come with the property of existence 
of canonical non deterministic tree automata. 

1 Introduction 

The study of tree automata has a long history in computer science; see the survey 
of Thatcher |Tha73| . a nd the tex ts of F. Gecseg and M. Stcinby |GS84IGS9fl| . 
and of the TATA group |CDG + 97] . With the advent of tree-based metalanguages 
(SGML and XML) for document grammars, new developments on tree automata 
formalisms and tree automata based algorithms have been done |MLM01INev02| . 
Also, because of the tree structure of documents, learning algorithms for tree 
languages have been defined for the tasks of information extraction and infor- 
mation retrieval Fer02 GK02 LPHOO) . We are currently involved in a research 
project dealing with information extraction systems from semi-structured data. 
One objective is the definition of classes of tree automata satisfying two proper- 
ties: there are efficient algorithms for membership and matching, and there are 
efficient learning algorithms for the corresponding classes of tree languages. 

In the present paper, we only consider finite ranked trees. There arc bottom- 
up (also known as frontier to root) tree automata and top-down (also known as 
root to frontier) tree automata. The top-down version is particularly relevant for 
some implementations because important properties such as membership 1 can 
be solved without handling the whole input tree into memory. There are also 
deterministic tree automata and non-deterministic tree automata. Determinism 
is important to reach efficiency for membership and other decision properties. 
It is known that non-deterministic top-down, non-deterministic bottom-up, and 
deterministic bottom-up tree automata are equally expressive and define reg- 
ular tree languages. But there is a tradeoff between efficiency and expressive- 
ness because some regular (and even finite) tree languages are not recognized 
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1 given a tree automaton A, decide whether an input tree is accepted by A. 
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by deterministic top-down tree automata. Moreover, the size of a determinis- 
tic bottom-up tree automaton can be exponentially larger than the size of a 
non-deterministic one recognizing the same tree language. This drawback can 
be dramatic when the purpose is to build tree automata. This is for instance the 
case in the problem of tree pattern matching and in machine learning problems 
like grammatical inference. 

The process of learning finite state machines from data is referred as gram- 
matical inference. The first theoretical foundations were given by Gold |Gol67| 
and first applications were designed in the field of pattern recognition. Grammat- 
ical inference mostly focused on learning string languages but recent works are 
concerned with learning tree languages Sak90 Fcr02 GK02 . In most works, the 
target tree language is represented by a deterministic bottom-up tree automa- 
ton. This is problematic because the time complexity of the learning algorithm 
depends on the size of the target automaton. Therefore, again it is crucial to de- 
fine learning algorithms for non-deterministic tree automata. The reader should 
note that tree patterns |GK02j satisfy this property. 

Therefore the aim of this article is to define non-deterministic tree automata 
corresponding to sufficiently expressive classes of tree languages and having nice 
properties from the algorithmic viewpoint and from the grammatical inference 
viewpoint. For this aim, we extend previous works from the string case |DLT02aj 
to the tree case and we define residual finite state automata (RFTA). The reader 
should note that learning algorithms for residual finite string automata have been 
defined |DLT01IDLT02b| . 

In Section |3J we study the bottom- up case. We define the residual language 
of a language L w.r.t a ground term t as the set of contexts c such that c[t] is 
a term in L. We define bottom-up residual tree automata as automata whose 
states correspond to residual languages. Bottom-up residual tree automata are 
non-deterministic and recognize regular tree languages. We prove that every 
regular tree language is recognized by a unique canonical bottom-up residual 
tree automaton, minimal according to the number of states. We give an example 
of regular tree languages for which the size of the deterministic bottom-up tree 
automata grows exponentially with respect to the size of the canonical bottom- 
up residual tree automata. 

In Sectional we study the top-down case. We define the residual language 
of a language L w.r.t a context c as the set of ground terms t such that c[t] 
is a term in L. We define top-down residual tree automata as automata whose 
states correspond to residual languages. Top-down residual tree automata are 
non-deterministic tree automata. Interestingly, the class of languages recognized 
by top-down residual tree automata is strictly included in the class of regular 
tree languages and strictly contains the class of languages recognized by deter- 
ministic top-down tree automata. We also prove that every tree language in this 
family is recognized by a unique canonical top-down residual tree automaton; 
this automaton is minimal according to the number of states. 

The definition of residual finite state automata comes with new decision 
problems. All of them rely on properties of residual languages. It is proved that 
all residual languages of a given tree language L can be built in both top-down 
and bottom-up cases. From these constructions we obtain positive answers to 
decision problems like 'decide whether an automaton is a (canonical) RFTA'. 
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The exact complexity bounds are not given but we conjecture that are identical 
than in the string case. 

The present work is connected with the paper by Nivat and Podelski |NP97) . 
They consider a monoid framework, whose elements are called pointed trees 
(contexts in our terminology, special trees in |Tho84| ). to define tree automata. 
They define a Nerode congruence in the bottom-up case and in the top-down 
case. Their work leads to the generalization of the notion of deterministic to 
1-r-deterministic (context-deterministic in our terminology) for top-down tree 
automata. They have a minimization procedure for this class of automata. It 
should be noted that the class of languages recognized by context-deterministic 
tree automata (also called homogeneous tree languages) is strictly included in 
the class of languages recognized by residual top-down tree automata. 

2 Preliminaries 

We assume that the reader is familiar with basic knowledge about tree automata. 
We follow the notations defined in TATA |CDG+97j . 

A ranked alphabet is a couple (J-, Arity) where T is a finite set and Arity 
is a mapping from T into N. The set of symbols of arity p is denoted by T v . 
Elements of arity 0, 1, . . .p arc respectively called constants, unary, . . . , p-ary 
symbols. We assume that T contains at least one constant. In the examples, we 
use parenthesis and commas for a short declaration of symbols with arity. For 
instance, a is a constant and /(, ) is a short declaration for a binary symbol /. 
The set of terms over T is denoted by T(F). Let o be a special constant which is 
not in F . The set of contexts (also known as pointed trees in |NP97) and special 
trees in |Tho84| ). denoted by C(J-), is the set of terms which contains exactly 
one occurrence of o. The expression c[o] denotes a context, we only write c when 
there is no ambiguity. We denote by c[t] the term obtained from c[o] by replacing 
o by a term t. 

A bottom-up Finite Tree Automaton ("f-FTA) over T is a tuple A = (Q, T ,Qf, A) 
where Q is a finite set of states, Qf C Q is a set of final states, and A is a 
set of transition rules of the form f(qi,---,q n ) ~> Q where n > 0, / G J- n , 
q, q\, . . . , q n G Q. In this paper, the size of an automaton refers to its size in 
number of states, so two automaton which have the same number of states but 
different number of rules are considered as having the same size. When n = 0a 
rule is written a — > q, where a is a constant. The move relation is written —>a 
and — > A is the reflexive and transitive closure of — >a- A term t reaches a state 
q if and only if t — > A q. A state q accepts a context c if and only if there exists a 
qf £ Qf such that c[q] —> A qf. The automaton A recognizes a term t if and only 
if there exists a?/ G Qf such that t — >* A qf. The language recognized by A is the 
set of all terms recognized by A, and is denoted by L(A). 

Two f-FTA are equivalent if they recognize the same tree language. A |- 
FTA A = (QjTjQfjA) is trimmed if and only if all its states can be reached 
by at least one term and accepts at least one context. A f-FTA is deterministic 
(f-DFTA) if and only if there are no two rules with the same left-hand side in 
its set of rules. A tree language is regular if and only if it is recognized by a 
bottom-up tree automaton. As any f-FTA can be changed into an equivalent 
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trimmed j-DFTA, any regular tree language can be recognized by a trimmed 
t-DFTA. 

Let L be a tree language over a ranked alphabet T and t a term. The bottom- 
up residual language of L relative to a term t, denoted by t^ 1 L, is the set of all 
contexts in C{F) such that c[t] G L: 

t~ x L = {ce C{T) | c[t] G L}. 

Note that a bottom-up residual language is a set of contexts, and not a tree 
language. The Myhill-Ncrodc congruence for tree languages can be defined by 
two terms t and t' are equivalent if they define the same residual languages. 
From the Myhill-Nerode theorem fro tree languages, we get the following result: 
a tree language is recognizable if and only if the number of residual languages is 
finite. 

A top-down finite tree automaton (j-FTA) over T is a tuple A = (Q, T, I, A) 
where Q is a set of states, I C Q is a set of initial states, and A is a set of rewrite 
rules of the form q(f) — ► f(qi, . . . , q n ) where n > 0, / G T n , q, qi, . . . , q n G Q. 
Again, if n = the rule is written q(a) — > a. The move relation is written 
and ^ A is the reflexive and transitive closure of —*a- A state q accepts a term 
t if and only if q(t) —> A t. A recognizes a term t if and only if at least one of its 
initial states accepts it. The language recognized by A is the set of all ground 
terms recognized by A and is denoted by L(A). 

Any regular tree language can be recognized by a j-FTA. This means that 
j-FTA and j-FTA have the same expressive power. A j-FTA is deterministic 
( j-DFTA) if and only if its set of rules does not contain two rules with the same 
left-hand side. Unlike j-DFTA, j-DFTA are not able to recognize all regular tree 
languages. 

Let L be a tree language over a ranked alphabet T, and c a context of C{J-). 
The top-down residual language of L relative to c, denoted by c _1 L, is the set 
of ground terms t such that c[t] G L: 

c- x L = {te T(F) | c[t] G L}. 

The definition of top-down residual languages comes with an equivalence 
relation on contexts. It is worth noting that it does not define a congruence over 
terms. Nonetheless, based on |NP97) . it can be shown that a tree language L is 
regular if and only if the number of top-down residual languages associated with 
L is finite. In the proof, it is used that the number top-down residual languages 
is lower than the number of bottom-up residual languages. 

3 Bottom-up residual finite tree automata 

In this section, we introduce a new class of bottom-up finite tree automata, 
called bottom-up residual finite tree automata ( j-RFTA). This class of automata 
shares some interesting properties with both bottom-up deterministic and non- 
deterministic finite tree automata which both recognize the class of regular tree 
languages. 

On the one hand, as j-DFTA, j-RFTA admits a unique canonical form, based 
on a correspondence between states and residual languages, whereas j-FTA does 
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not. On the other hand, f-RFTA are non-deterministic and can be much smaller 
in their canonical form than their deterministic counter-parts. 

3.1 Definition and expressive power of bottom-up residual finite 
tree automata 

First, let us precise the nature of this correspondence, then let us give the formal 
definition of f -residual tree automata and describe their properties. 

In order to establish the nature of this correspondence between states and 
residual languages, let us introduce the notion of state languages. The state 
language C q of a state q is the set of contexts accepted by the state q: 

C q = {ceC(7)\3q f eQf,c[Q]^*AQf}- 

As shown by the following example, state languages are generally not residual 
languages: 

Example 1. Consider the tree language L = {/(ai, &i)> b%), f{a>2, ^2)} over 
T = {/(, ), cli, b\, 0,2, b%}. This language L is recognized by the tree automa- 
ton A = ({gi, 92,93,94,95}, -T 7 , fc},^) where A = {at -> 91, °i -> 92,62 -> 
93,a 2 -> 94, oi -> 94,/(9i,92) -> 95,/(94,93) -> 95}- Residual languages of L 
are a^L = {/(<>, 6 X ), /(©, fc)}, b^L = {/(oi,o)}, b^L = {/(a 1; o), /(a 2 , ©)}, 
a 2 X L = {/(o, 6 2 )}, f(a l7 b 1 y 1 L = {o}. The state language of 91 is {/(o,6i)}, 
which is not a residual language. The tree a\ reaches 91, so each context ac- 
cepted by 91 is an element of the residual language a^[ 1 L, which means that 
C qi C a^ 1 L. But the reverse inclusion is not true because/ (o, 6 2 ) is not an ele- 
ment of C qi . The reader should note that this situation is possible because A is 
non-deterministic. 

In fact, it can be proved (the proof is omitted) that residual languages are 
unions of state languages. For any L recognized by a tree automaton A, we have 

VteT(T),t- 1 L= |J C q . (1) 

qeQ, t ~^* A q 

As a consequence, if A is deterministic and trimmed, each residual language 
is a state language and conversely. 

We can define a new class of non-deterministic automata stating that each 
state language must correspond to a residual tree language. We have seen that 
residual tree languages are related to the Myhill-Nerode congruence and we will 
show that minimization of tree automata can be extended in the definition of a 
canonical form for this class of non-deterministic tree automata. 

Definition 1. A bottom-up residual tree automaton (]-RFTA) is a ]-FTA A = 
{Q,T,Q }l A) such thatVq e Q, 3t G T{T), C q = t- x L{A). 

According to the above definition and previous remarks, it can be shown that 
every trimmed |-DFTA is a f-RFTA. As a consequence, f-RFTA have the same 
expressive power than finite tree automata: 
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Theorem 1. The class of tree languages recognized by ]-RFTA is the class of 
regular tree languages. 

As an advantage of f-RFTA, the number of states of an f-RFTA can be much 
smaller than the number of states of any equivalent f-DFTA: 

Proposition 1. There exists a sequence (L n ) of regular tree languages such 
that for each L n , the size of the smallest ]-DFTA which recognizes L n is an 
exponential function of n, and the size of the smallest \-RFTA which recognizes 
L n is a linear function of n. 

Sketch of proof We give an example of regular tree languages for which the size 
of the f-DFTA grows exponentially with respect to the size of the equivalent 
canonical f-RFTA. A path is a sequence of symbols from the root to a leaf of 
a tree. The length of a path is the number of symbols on the path, except the 
root. Let T = {/(, ), a} and let us consider the tree language L n which contains 
exactly the trees with at least one path of length n. Let A n = (Q, !F,Qf, A) be 
a j-FTA defined by: Q = {q*,qo, ■ ■ ■ , q n },Qf = {qo} and 

Zl = {fl^g Sl a^ q n ,f{q*,q*) <7*} u 

n 

U {f(Qk,o) -> Qk-i,f(q,qk) -> qk-i,f(qk,q) -> q*,f(q,qk) -> q*} 

ke[l,...,n],qeQ\{q } 

Let C* be the set of contexts which contain at least one path of length n. 
Let Ci be the set of contexts whose path from the root to o is of length i. Let i* 
be a term such that all its paths are of length greater than n. Note that the set 
of contexts c such that c[t*] belongs to L n is exactly the set of contexts C*. Let 
to . . . t n be terms such that for all i < n, tj contains exactly one path of length 
smaller than n, and the length of this path is n — i. Therefore, t~ 1 L n is the set 
of contexts C* U C, . 

One can verify that C qt is exactly t~ 1 L n = C* , and for all i < n, C qi is exactly 
tj L n = C* U Ci. The reader should note that rules of the form f(qk,q) — > q* 
and f(q, q^) — > are not useful to recognize L n but they are required to obtain 
a |-RFTA (because C, is not a residual language of L n ). So A n is a |-RFTA 
and recognizes L n . The size of A n is n + 2. 

The construction of the smallest f-DFTA which recognizes L(A n ) is left to 
the reader. But, it can easily be shown that the number of states is in 0(2") 
because states must store lengths of all paths smaller than n. □ 

Unfortunately, the size of a |-RFTA can be exponentially larger than the 
size of an equivalent f-FTA. 

3.2 The canonical form of bottom-up residual tree automata 

As |-DFTA, |-RFTA have the interesting property to admit a canonical form. In 
the case of f-DFTA, there is a one-to-one correspondence between residual lan- 
guages and state languages. This is a consequence of the Myhill-Nerode theorem 
for trees. 

A similar result holds for f-RFTA. In a canonical f-RFTA, the set of states 
is in one-to-one correspondence with a subset of residual languages called prime 
residual languages. 
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Definition 2. Let L be a tree language. A bottom-up residual language of L is 
composite if and only if it is the union of the bottom-up residual languages that 
it strictly contains: 

t- x L = |J t'-'L. 

t'- 1 igt- 1 i 

A residual language is prime if and only if it is not composite. 

Example 2. Let us consider again the tree languages in the proof of Proposi- 
tion Let Q n be the set of states of A n . All the n + 2 states q* , qo, . . . , q n of Q n 
have state languages which are prime residual languages. The subset construc- 
tion applied on A n to build a f-DFTA D n leads to consider states which are 
subsets of Q. The state language of a state {q^ ■ ■ ■ qt n } is a composite residual 
language. It is the union of t~ L . . . t~ L. 

In canonical |-RFTAs, all state languages are prime residual languages. 

Theorem 2. Let L be a regular tree language and let us consider the ]-FTA 
A can = {Q,T,Q f ,A) defined by: 

— Q is in bijection with the set of all prime bottom-up residual languages of L. 
We denote by t q a ground term such that q is associated with t~ 1 L in this 
bijection 

— Q f is the set of all elements q of Q such that t~ l L contains the void context 
o, 

— A contains all the rules f(q±, . . ■ , q n ) — > q such that t~ L C (f(t Ql , . . . , t qn ))~ 1 L 
and all the rules a — > q such that a G J-q and t~ L C a~ 1 L. 

A C an is a ^-RFTA, it is the smallest ]-RFTA in number of states which recognizes 
L, and it is unique up to a renaming of its states. 

Sketch of proof There are three things to prove in this theorem: the canonical 
f-RFTA A can = (Q, J 7 , Qf, A) of a regular tree language L recognizes L, it is a 
f-RFTA, and there cannot be any strictly smaller |-RFTA which recognizes L. 
The three points are proved in this order. 

We first have to prove the equality L(A can ) = L. It follows from the identity 
(©) Vi, t~ 1 L = U 9 eQ t-f* qtq 1 ^ which can be proved inductively on the 
height of t. Using this property, we have: 

t <E L <f> o e t~ 1 L «oe (J t^L <3>3q f E Qf,t^>* Aaan qf £ L(A can ) 

The equality between L and L(A can ) helps us to prove the characterization 
of |-RFTA: t~ 1 L = C q can where C q can is the state language of q in A can . 

The last point can be proved in such a way. In a |-RFTA, any residual 
language is a union of state languages, and any state language is a residual 
language. So any prime residual language is a state language, so there is at 
least as much states in a |-RFTA as prime residual languages admitted by its 
corresponding tree language. 

□ 
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The canonical automaton is uniquely denned determined by the tree language 
under consideration, but there may be other automata which have the same num- 
ber of states. The canonical f-RFTA is unique because it has the maximum num- 
ber of rules. Even though all its states are associated to prime residual languages, 
the automaton considered in the proof of Proposition ^ is not the canonical one 
because some rules are missing: {Jl =1 {f{qk, qo) — > Qk-i,f(qo,Qk) — ► Qk-i} and 
Ugeoi/O?: Qo) ?*>/(?> Qo) -» <?*}• 

4 Top-Down residual finite tree automata 

The definition of top-down residual finite tree automata (J.-RFTA) is tightly 
correlated with the definition of f-RFTA. Similarly to f-RFTA, |-RFTA arc de- 
fined as non-deterministic tree automata where each state language is a residual 
language. Any j-RFTA can be transformed in a canonical equivalent j-RFTA 
— minimal in the number of states and unique up to state renaming. 

The main difference between the bottom-up and the top-down case is in the 
problem of the expressive power of tree automata. The three classes of bottom- up 
tree automata, f-DFTA, f-RFTA or f-FTA, have the same expressive power. In 
the top-down case, deterministic, residual and non-deterministic tree automata 
have different expressive power. This makes the canonical form of j-RFTA more 
interesting. Compared to the minimal form of j-DFTA, it can be smaller when 
both exist, and it exists for a wider class of tree languages. 

Let us introduce j-RFTA through their similarity with f-RFTA, then study 
this specific problem of expressiveness. 

4.1 Analogy with bottom-up residual tree automata 

Let us formally define state languages in the top-down case: 

Definition 3. Let L be a regular tree language over a ranked alphabet T , let A 
be a top-down tree automaton which recognizes L, and let q be a state of this 
automaton. The state language of L relative to q, written L q , is the set of terms 
which are accepted by q: 

L q = {teT{T)\q{t)^ A t}. 

It follows from this definition some properties similar to those already stud- 
ied in the previous section. Firstly, state languages are generally not residual 
languages. Secondly, residual languages are unions of state languages. Let us 
define Q c : 

Q c = {q \q e Q, 3q, £ I, 9i(c[o]) ->* A c[q(o)}}. 

We have the following relation between state languages and residual lan- 
guages. 

Lemma 1. Let L be a tree language and let A = (Q,^ 7 , /, A) be a top-down tree 
automaton which recognizes L. Then Vc e C(F), UqgQ L q = cr x L. 
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These similarities lead us to this definition of top-down residual tree au- 
tomata: 

Definition 4. A top-down Residual Finite Tree Automaton ([-RFTA) recogniz- 
ing a tree language L is a [-FTA A = {Q,T,I, A) such that: Vg G Q, 3c G C(F), 
L q = cr x L. 

Languages defined in the proof of Proposition ^ ar e still interesting here to 
define examples of top-down residual tree automata: 

Example 3. Let us consider again the family of tree languages L n , and the family 
of corresponding f-RFTA A n . For every n, let A' n be the J.-RFTA defined by: 
Q = {q*,Qo,---,<ln},Qi = {<7o} and A = ~> a,q n (a) -> a,q*(f) -> 

f{q*,q*)} uUL{ft-i(/) -> f(qk,q*),qk-i(f) -> f(q*,qk)}- 

For every k < n, the state language of is equal to And, L n — fc is 

the top-down residual language of Cfc, where Cfc is a context whose height from 
the root to the special constant o is k and does not contain any path whose 
length is smaller or equal to n. The state language of g* is T(T). And. T(J") is 
the top-down residual language of L n relative to c*, where c» is a context who 
contains a path whose length is n. So A' n is a J.-RFTA. Moreover, it is easy to 
verify that A' n recognizes L n . 

4.2 The expressive power of top-down tree automata 

Top-down deterministic automata and path-closed languages A tree language L 
is path-closed if: 

\/ceC{^),c[f{t u t2)] eLAc[/(ti,f a )] ei^c[/(i!,4)] gL. 

The reader should note that the definition only considers binary symbols, 
the definition can easily be extended to n-ary symbols. The class of languages 
that J.-DFTA can recognize is the class of path-closed languages |Vir81| . 

Context- deterministic automata and homogeneous languages. Podclski and Ni- 
vat in |NP97j have defined I- r- deterministic top-down tree automata. In the 
present paper, let us call them top-down context- deterministic tree automata. 

Definition 5. A top-down context- deterministic tree automaton (\-CFTA) A is 
a \-FTA such that for every context c S C^J 7 ), Q c is either the empty set or a 
singleton set. 

An homogeneous language is a tree language L satisfying: 

VceC(F),c[f(h,t 2 )] eLAc[f(t u t' 2 )] eLAc[f(t[,t 2 )} =►<:[/(*;, f 2 )] G L. 

Again, the definition can easily be extended from the binary case to n-ary 
symbols. They have shown that the class of languages recognized by j-CFTA is 
the class of homogeneous languages. 
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The hierarchy A j-DFTA is a j-CFTA. For j-CFTA and j-RFTA, we have the 
following result: 

Lemma 2. Any trimmed [-CFTA is a [-RFTA. 

Proof. Let A = (Q : F, I, A) be a trimmed j-CFTA recognizing a tree language 
L. As A is trimmed, all states are reachable, so for every g, there exists a c such 
that q e Q c . Then, by definition of a j-CFTA, for every q, there exists a c such 
that {q} = Q c . Using Lemma Q] we have: 

\/qeQ 1 3ceC{T),L q = c - 1 L. 
stating that A is a j-RFTA. □ 

Therefore, if we denote by Cc the class of tree languages recognized by a 
class of automata C, we obtain the following hierarchy: 

£l-DFTA Q &1-CFTA ^= &1-RFTA Q £>[ — FTA 

The hierarchy is strict 

— Let L = {/(a, b), /(&, a)}. L\ is homogeneous but not path-closed. Therefore 
L can be recognized by a j-CFTA, but can not be recognized by a j-DFTA. 

— The tree languages L n in the proof of Proposition are not recognized by 
j-CFTA. We can easily verify that L n is not homogeneous. Indeed, if t is a 
term which has a path whose length is equal to n — 1, and t' a term which 
docs not have any path whose length is smaller than n, f(t, t), f(t, t'), f(t', t) 
belong to L n , but f(t', t') does not. And, we have already shown that L n is 
recognized by a J.-RFTA. 

— Let V = {f(a,b),f(a,c),f(b,a),f(b,c),f(c,a),f(c,b)}. L' is a finite lan- 
guage, therefore it is a regular tree language which can be recognized by a 
j-FTA. L' cannot be recognized by a j-RFTA. To prove that, let us consider 
A' a J.-FTA which recognizes L' . The top-down residual languages of L 1 are 
{a, b}, {a, c}, {b, c} and V . As A' recognizes L', it recognizes /(a, b). This im- 
plies the existence of three states qi, q2, q% and three rules qi(f) — > /(<Z2, 93) ; 
92(a) — > a, and 93(6) — > b. If A' was a j-RFTA, then g 2 would accept a 
residual language. As 02 accepts a, it would accept either {a, b} or {a,c}. 
Similarly, 03 would accept either {a, b} or {6,c}. In these conditions, and 
thanks to the rule qi(f) — > f(q2,Q3), A 1 would recognize f(a, a), f(b,b) or 
/(c,c). So A' cannot be a j-RFTA. 

Therefore, we obtain the following result: 
Theorem 3. C^-dfta £ Ci-cfta £ £i-rfta £ £^-_f t\4 

So top-down residual tree automata are strictly more expressive than context- 
deterministic tree automata. But as far as we know, there is no straightforward 
characterization of the tree languages recognized by j-RFTA. 
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4.3 The canonical form of top-down residual tree automata 

The problem of the canonical form of top-down tree automata is similar to the 
bottom-up case. Whereas there is no way to reduce a non-deterministic top-down 
tree automaton to a unique canonical form, a top-down residual tree automaton 
can take such a form. Its definition is similar to the definition of the canonical 
bottom-up tree automaton. 

In the same way that we have defined composite bottom-up residual language, 
a top-down residual language of L is composite if and only if it is the union of 
the top-down residual languages that it strictly contains and a residual language 
is prime if and only if it is not composite. 

Theorem 4. Let L be a tree language in the class Ci-rfta- Let us consider 
the l-RFTA A can = (Q,T,I,A) defined by: 

— Q is a set of state in bisection with the prime residual languages of L. For 
each of these residual languages, there exists a c q such that q is associated 
with c q 1 L in this bijection. 

— I is the set of prime residuals which are subsets of L. 

— A contains all the rules q(a) — > a such that a is a constant and c q [a] G L, and 
all the rules q(f) — ► ■ ■ ■ , q n ) such that for all t\ . . . t n where ti G c~^L, 

Cq [/(*!,...,*„)] G L. 

A ca n is a l-RFTA, it is the smallest l-RFTA in number of states which 
recognizes L, and it is unique up to a renaming of its states. 

Sketch of proof 

The proof is mainly based on this lemma: t G c~ x L t G L^ ca " 

where L^ can is the state language of q in the automaton A can . 

This lemma is proved by induction on the height of t. This is not a straightfor- 
ward induction. It involves the rules of a J.-RFTA automaton A' which recognizes 
L. Its existence is granted by the hypothesis of the theorem. 

Once this is proved, it can be easily deduced that A can recognizes L and is a 
RFTA. As there is one state per prime residual in A can , it is minimal in number 
of states. 

□ 

5 Decidability issues 

Some decision problems naturally arise with the definition of RFTA. Most of 
these problems arc solved just noting that one can build all residual languages 
of a given regular language L defined by a non-deterministic tree automaton. In 
the bottom- up case, the state languages of the minimal f-RFTA which recognizes 
L are exactly the residual languages of L, and this automaton can be built with 
the subset construction. In the top-down case, the subset construction does not 
necessarily gives us an automaton which recognizes exactly L, but it gives us 
the set of all residual languages. Therefore, knowing whether a tree automaton 
is a RFTA, whether a residual language is prime or composite, and whether a 
tree automaton is a canonical RFTA are decidable. These problems have not 
been deeply studied in terms of complexity, but they are at least as hard as the 
similar problems with strings, that is they are PSPACE-hard ( |DLT02a] V 
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6 Conclusion 

We have defined new classes of non-deterministic tree automata. In the bottom- 
up case, we get another characterization of regular tree languages. More inter- 
estingly, in the top-down case, we obtain a subclass of the regular tree languages. 
For both cases, we have a canonical form and the size of residual tree automata 
can be much smaller than equivalent (when exist) deterministic ones. 

We are currently extending these results to the case of unranked trees because 
our application domain is concerned with html and xml documents. Also, we 
are designing learning algorithms for residual finite tree automata extending 
previous algorithms for residual finite string automata |DLT01IDLT02b| . 
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A Appendix 

A.l Proof of Equation 

Let I be a tree language and (Q,!F,Qf,A) a f-FTA which recognizes it. We 
show that Vt G T{T), t~ x L = \J t ^, q C q . 

Let t £ T{T), and c G t^ 1 L. c[t] G so there exists g/ G Qf and q G Q 
such that c[t] — >^ c[g] — >* A qf, where t^* A q and c G C g . So c G Ut->* q ^q- So 

Let i G T{J-), and c G (J t ^* 9 Cg- There exists a, q <E Q such that f g and 
c G C q . So there exists qf G Q/ such that c\t] — >* A c[q] ~^* A qf- So c G t^ 1 L. So 
\J t ^ q C q Ct^L 

A.2 Proof of the theorem [U 

Theorem 5. 27ie canonical ]-RFTA recognizing a regular tree language is the 
smallest ]-RFTA which recognizes it. Therefore, ]-RFTA accepts a unique and 
minimal representation. 

The first point we have to demonstrate in this theorem is that the canonical 
f-RFTA that we have defined recognizes L. 

Before this demonstration, we need to establish two properties of residual 
languages: 

Lemma 3. Let L a regular language. 

Vi,l<i< n,tr x L C t'r l L => f(t u . . . ,i„)- x i C f(t[, t'J^L 

Proof. This lemma can be proven inductively on i. Let t\ . . . t„ such that for 
all i, t~ 1 L is a subset of t'~ 1 L. Let c in f(t\, . . . , t n )~ 1 L. Let us assume that 
c[f(t' 1: t' i _ 1 ,t i , t n )] G L. This implies that c[/(t' l5 ...,t' i _ 1 ,o, t i+1 , t n )] G 
tJ x L, and therefore c[f(t[, . . . ,t' t _ 1 ,o,t i+1 , . . . ,t n )\ G t'~ 1 L. 

So c[/(*i, . ..,<{, t i+1 , t n )] G L. Inductively, c[f(t[, . . . , t'J] G L. 

So/^,...,^)-^^/^,...,^)-^. 

□ 

Lemma 4. 

Vi,l < i < n,t^L |J', ; /. /(ti, . . .,t n y l L = |J /(ti A) . . . > ^„)- 1 i 

31... Jn 

Here, j has to be understood as 'the union of all the possible combina- 
tion of j x . . .j n '. 

Proof. Let t\ . . . t n and for all i < n, ti_\ . . . U^ mi such that tj 1 L = Ui<j < m ^Tj -L- 
Vtijx ■ • .*nj„,Vi < n,tr].L C tr 1 ! =^ ernm |g 

Vtlji • • • in,3„j f(tl,ji ■ ■ • tn,j n ) L C. f(tx, • ■ • ,t n ) i => 



Residual Finite Tree Automata 15 



U f(h dl ,. . .,t ndn )- l L C /(*!, . . . , t„)- 1 L 

Now, let c in /(£i, . . . , t n )~ 1 L. 

c[f (t u ...,t n )] el^c[/(o,t 2 ,...,ye«- 1 L 

As t^L = (Jfr.£, there exists £i, mi such that c[/(o, i 2 , . . . ,t n )] 6 t^] ni L. 

SO c[/(*i, mi ,*2, • ■ ■ ,*n)] G 

It can be proven inductively on i that there exists t± mi . . . i n ,m„ such that 
c[/(*l,fci,- • -,*n,mj] G L. So c S /(ii.mu- ■ ■ ,*n,m„) _1 £- So: 

f(ti,...,t n ) 1 LC y f(ti,j 1 ,---,t n ,j n ) 1 L 

31— 3n 

□ 

Now, we can prove inductively this lemma, which is the main step to prove 
the equality between L and L{A can ) 

Lemma 5. 

Vt,t~ 1 L= |J i^L 

Proof. Let us prove this lemma inductively. Let /i(t) be the height of t. 

Let us assume that h(t) = 1, so £ = a where a is a constant. A residual is a 
union of prime residuals, so: 

a~ 1 L= (J i" 1 ! 

t7 1 iCa- 1 i 

As t" 1 ^ C a~ 1 L if and only if A ca „ contains the rule a — > q: 

a -1 L = |J t" 1 ! 

Now let us assume that for any term t such that h(t) < k, lemma is true. 
Let t = f(t\, . . . , t„) such that h{t) = k + 1. 

ft(t) = fc + 1 => V* < n, h(U) = k => V*, i^Z = y i-^ L =* IemnM g 

t X L= y /(*X,gu • • • )*n,g„) ^ 

Any residual is a union of prime residuals, so for all ji . . . j„: 

/(w )~ l£ = U 

So: 
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^ lL = U ( U 

As t^L C f(t qiji ,...,t qnjn ) L if and only if A ccm contains the rule 
/(<7ljl, ■ ■ ■ ,9nj„) -> g, 

t ~**A can 1 

□ 

The equality between L and L(A CQn ) is formalized as such: 
Lemma 6. The canonical ]-RFTA A ca n of a language L recognizes L, that is: 

3qf G Qf,t -^A can qf t <E L 
Proof. Let t G T(J-) and qf £ Qf such that t^* A g/. 

As o e tq}L, o e so t e L. 

Let feL. 

o G =>■ 3^ | £ — 9j Aoe i^i 
3 <?j I t -** Acan q 3 A cfo G Q f 

Now, we have to prove that the canonical |-RFTA is a f-RFTA. In order to 
do this, we need to establish this lemma: 

Lemma 7. Lett~ x L andt~}L be 'prime bottom-up residual languages of L. Let 
C q oarl and Cq, can be sets of contexts accepted by q and q' in then canonical 
automaton of L A can . Then: 

t-}L c t~ l L => c*r n c c g A - 

Proof. Let t q and t q i such that t~, L C t~ x L. For all t qi . . . t qn , f(t qi , . . . , t q /, . . . , t qn )~ 1 L C 
f(t qi ,...,t q ,...,t qn ) (lcmmaEJ). 

The construction of the set of rules of the canonical automaton implies that: 



So: 



f(q x , . . . , q n ) -^q' EA^ t~>L C f(t qi , . . . , t Qn 
f(q u ...,q',...,q n ) q" G A => 

(,„ 1 le/fe,..,V,...,tr 1 i^ 

t g „ L E f(t qi , . . . ,t q , . . . ,t qn ) X L => 

f(q 1 ,...,q,...,q n ) -> q" E A 
So each context accepted by g' is accepted by g. 

So cfr n c c 9 A — . 
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Lemma 8. The canonical RFTA A can of a language L is a residual finite tree 
automata. 

Proof. Let t~ 1 L be a prime residual language of L. Thanks to lemma [5] 

If t~ 1 L would strictly contain all the t~}L of the union, it would be composite. 
As it is prime, t~ x L is itself an element of this union, so t q —^A ean q. 
Equation Q tells us that: 

t-'L = y c q , 

So C q C t- q x L. 

For all q' such that t q —>A ca „ l' ■ tq^L c V*^' so C q > C C q (lemmaEJ. As the 
union of all C q / is equal to t~ x L, t~ x L C C q 

So t~ L = C q , so every prime residual language is accepted by its corre- 
sponding state. 

So A can is a RFTA. 

□ 

Lemma 9. The canonical RFTA A can of a language L is the smallest RFTA 
which recognizes L. 

Proof. Let A can be the canonical RFTA of a language L, and t such that $q £ 
Q,t 1 L — C q . 

Thanks to lemmaEJ tr 1 L = Ut-** q V ^ As ^ e Q,t~ x L = t~ l L, t~ x L 
is a union of residuals that it strictly contains. So t~ x L is a composite residual. 

So for all prime residuals t~ 1 L, there is a q such that t^ 1 L = G q . A can 
contains as much states as prime residuals in L, so it is the smallest RFTA 
which recognizes L. 

a 

A. 3 Proof of the theorem HI 

Theorem 6. Let L be a language recognized by a [-RFTA. The canonical top- 
down residual tree automaton of L is the smallest \,-RFTA which recognizes L. 

In order to prove this theorem, let us firstly prove these lemma: 

Lemma 10. Let A = (Q,T,I,A) be a {-RFTA which recognizes L. For any 
prime residual c~ l L, there exists a state q £ Q such that L q = c~ x L. 

Proof. Let c be a context of L such that pq £ Q,c~ 1 L = L q . 

Lemma ^ implies that c~ 1 L = U ? eQ L q and none of these L q arc equal to 
c~ 1 L. As \/q £ Q,L q = Cq l L, we have c~ 1 L = U 9 eQ c ^ 1 -^ where none of the 
c~ 1 L are equal to c~ 1 L. So c~ x L is composite. 

□ 
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Now, let us make the main part of the demonstration: let us prove that each 
prime residual language is exactly accepted by a state of the canonical j-RFTA. 

Lemma 11. Let L be a language recognized by a [-RFTA. Let A can = (Q, T , J, A) 
be its canonical automaton. For all q in Q, c~ L = L q . 

Proof. As seen in the definition, Q is in bijection with the set of all residual lan- 
guages, so for all q there exists a corresponding c~ x L. Let us prove inductively 
on the height of t that t G c^L <st£ LA can , q - Let us call H (n) this hypothesis 
when h(t) < n. 

Firstly, let us prove H(l). 

Let t such that h(t) = 1 and t G c~ l L, As h(t) = 1, t = a where a is a con- 
stant. As t G c~ 1 L, c q [a] G L. So A contains the rule q{a) — > a, so t G LA can , q - 
Reciprocally, t G LA ca „, q where t — a implies that A contains the rule q(a) — > a. 
This rule exists in the canonical automata if and only if a is a constant and 
c q [a] G L. So c q [a] G i, so t G c^L. 

Now, let us assume that H(l) is true when I < k. Let us prove that H (k) is 
true. 

Let t = f(ti, . . . , t n ) G c^L such that h(t) = k. For all ti where 1 < i < n, 
U G c q [f(ti, . . . , ij_i,o, £j+i, . . . , £„)] X L. 

Now, let us consider A' = (Q 1 , J 7 , 1', A') a j-RFTA which recognizes L. As L 
is recognized by a J,-RFTA, A' exists. We will use this automaton to prove the 
existence of a rule q — > f{qi, . ■ ■ , q n ) such that for all i, qi[ti] ^ A ti in A can . 

As c q 1 L is prime, there exists a q 1 G Q' such that La 1 .q 1 = c q x L (lemma 
ll(J[) . As t G c~ q 1 L, there exists in Z\' a rule q' — > . . . , q' n ) such that for all i, 
1 < i < n, we have tj G La> , q '.- 

For all ^ . . . ^ such that ^ G ^A'.^j ■ ■ ■ , 4) e c q lL - 

As La 1 ^ 1 . is a residual, it is cither a prime residual or a composite residual. If it 
is a prime residual, there exists a q L G Q such that L^.q'. — c q ^L and ij G c~^L. 
If it is a composite residual, there exists a <?,; G Q such that c~ L C La' , q '. and 

So there exists q\ . . .q n such that G c" 1 !/ C La' ,q'. ■ So for all t[ . . . t' n in 

°qi L ■ ■ ■ C q,! L > ' ■ ' ' 4) G ^A'^ = C ^ lL ' S ° tllC rul ° <?(/) fill, Qn) 

exists in A. 

For all ti, h(ti) < k, so as we have assumed that H(l) is right when / < k, 
H{h(U)) is right. So for all i, U G L Acan , qi - As q(f) -> . . . ,q n ), t G L Acan , q - 
We have proven that t G c 9 _1 L => i G iA ca „, g - Now let us prove that 

* £ ^A ca „,g => t G C" 1 ^. 

Let t = f(ti, . . . , t„) G LA can .q such that = k. 

There exist qi...q t such that . . . t n )) -^* A /(<?i(*i), q n {t„)) -+* A /(*i> ■ • 

For all i, ti G LA can , qi and fo(ij) < fc, so H{h(ti)) is assumed to be true, so 
ti G c q ^L. The existence of the rule q(f) —> f(qi, ■ ■ ■ , q n ) in <4 implies that for 
all t[ . . . t' n such that t[ G c~^L, c q [f(t[, . . . , 4)] G L. So i G c~ x L. 

So H(k) is true. We have proven inductively that for any t, t G LA can . q 
i G c-ii. 

□ 
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Lemma 12. A can =< Q, T , Qi, A > is a {-RFTA, recognizes L, and is minimal 
in number of states. 

Proof. Let us prove that lcmmalTTIimDlies that L(A can ) = L. Let t G L. o^ 1 L = 
L is a residual, so it is a union of prime residuals. So there exists qi £ Q such that 
t G c~^L and c^i C L. As c~ L = LA can , qi , we have i G LA aari , qi - c 7 t L Q L, 
so qi is initial, so t £ L(A can ). 

Reciprocally, let t G L(A can ). There exists a qi G / such that t G LA can , qi . 
c~^L = L gi , so t G c~. L. As (?j is initial, c" 1 !/ is a subset of L. So t G £. So 
i = L(A can ). 

So A C£m recognizes L. For any q, = c~ L, so A ca „ is a RFTA. For any 
prime residual of L, there exists a state in the RFTA which recognizes it. As 
there are one state per prime residual in A can , A can is minimal in number of 
states. 

□ 



